Validity Toward an Understanding of Situational Judgment Item Validity

نویسندگان

  • Michael A. McDaniel
  • Peter J. Legree
چکیده

Consensually scored situational judgment tests using Likert scale response formats can be substantially improved with respect to validity, Black-White mean differences, resistance to faking, and test length. This improvement is achieved with two simple adjustments. The first adjustment is controlling for elevation and scatter (Cronbach & Gleser, 1953). This adjustment substantially improves item validity. Also, because there is a race difference in the preference for extreme responses on Likert scales (Bachman & O’Malley, 1984), these adjustments substantially reduce Black-White mean score differences. In addition, these adjustments eliminate the score elevation associated with the faking strategy of avoiding extreme responses (Cullen, Sackett, & Lievens, 2006). Item validity is shown to have a U-shaped relationship with item means. The second adjustment is to drop items with mid-range item means. This permits the scale to be shortened dramatically without harming validity. Situational judgment item validity 3 Situational judgment tests (SJTs) present job applicants with written or video-based problem scenarios and a set of possible response options. Job applicants evaluate the effectiveness of the responses for addressing the problem described in the scenario. Although SJTs have been used in personnel selection for about 80 years (McDaniel, Morgeson, Finnegan, Campion & Braverman, 2001; Moss, 1926), and have been the subject of substantial research in the last two decades (McDaniel, Hartman, Whetzel & Grubb, 2007; Motowidlo, Dunnette & Carter, 1990; Weekley & Ployhart, 2006), there is very little research addressing how to best build and score SJTs (Schmitt & Chan, 2006; Weekley, Ployhart & Holtz, 2006). There is also little knowledge concerning the best approaches to build scales using SJT items to tap specific constructs. In the absence of this knowledge, many approaches have evolved for developing and scoring SJTs (Weekley, Ployhart & Holtz, 2006; Bergman, Drasgow, Donovan, Henning & Juraska, 2006) and the effectiveness of these methods for maximizing criterion-related and construct validity is largely unknown. Unlike cognitive ability or job knowledge tests, response options in SJTs cannot easily be declared correct or incorrect. As such, items are typically scored using some form of consensus judgment (Legree, Psotka, Tremble, & Bourne, 2005). Expert judges are often asked to reach consensus concerning which responses are preferred (Weekley et al., 2006). Consensus may also be based on the responses of applicants, incumbents, or supervisors of incumbents. In such applications, the means of the respondents are considered the correct response. There are several different response formats. When response instructions request that a respondent pick a single response (identify the behavior that you would most [or least] likely do, identify the most effective [or ineffective] response), the means are used to identify the response judged to be correct. Another format involves asking respondents to rate each response option using a Likert scale. Using this format, an applicant’s score is often expressed as a deviation or a squared deviation from the mean. Alternatively, the mean is used to determine whether the item response is judged effective or ineffective, and the item is scored dichotomously (McDaniel, Whetzel & Nguyen, 2006). Consensual scoring is a form of profile matching. One profile consists of the means of the items collected from some group (e.g., experts, applicants, incumbents). The other profile is the item responses of one respondent. A respondent’s score on the SJT is a function of the degree of match between the respondent’s answers and the group means. Cronbach and Gleser (1953) conceptualized profile matching with respect to elevation, scatter, and shape. Elevation is the mean of the items across a respondent. Scatter reflects the magnitude of a respondent’s score deviations from the respondent’s own mean. If one standardizes scores using a within-person z transformation, all respondents would have the same mean (zero) and the same standard deviation (one) across items. This transformation removes information from the scores related to elevation and scatter because all respondents have identical elevation and scatter. The remaining score information in the within-person standardized scores is called shape. Cronbach and Gleser argued that the investigator should consider whether elevation and scatter are important in their profile matching application. For SJTs, we suggest that elevation and scatter reflect response tendencies such as a preference for using one end of the Likert rating scale over another (e.g., rating most responses as effective, or rating most responses as ineffective) or preferences for extreme or more mid-scale Likert ratings (e.g., on a nine-point Likert scale preferring ratings of one and nine over ratings of three and seven). We assert that these response tendencies are primarily criterion-irrelevant noise in the ratings which damage the SJT item validity. Thus, we offer the following hypothesis: Situational judgment item validity 4 Hypothesis 1: SJT scoring methods that control for elevation and scatter will yield higher response option validities than methods that do not. Although seldom considered in the I/O and management literatures, there are BlackWhite differences in the use of Likert scales (Bachman & O’Malley, 1984). Specifically, Blacks tend to use extreme rating points (e.g., 1 and 7 on a 7-point scale) with greater frequency than Whites on average. Extreme rating points, on average, will have larger deviations from the consensual mean resulting in less favorable scores. The tendency of Blacks to use extreme ratings more so than Whites, will tend to increase Black-White differences. Controlling for elevation and scatter adjusts for individual differences in extreme responding. Thus, we offer the following hypothesis: Hypothesis 2: SJT scoring methods that control for elevation and scatter will yield lower Black-White mean differences than methods that do not. Cullen, Sackett and Lievens (2006) examined the coachability of two SJTs. In addition to evaluating a curriculum focused on strategies for improving scores, they simulated what would happen if a respondent was coached not to endorse extreme values. This simulation was done by changing the 1 and 2 responses to a 3 and by changing the 6 and 7 responses to a 5. They discovered that scores could be improved by 1.57 standard deviations if examinees did not endorse extreme answers (e.g., 1 or 2 and 6 or 7). In this paper, we refer to this strategy as avoiding extreme responses. Controlling for elevation and scatter adjusts for individual differences in extreme responding. Thus, we offer the following hypothesis: Hypothesis 3: SJT scoring methods that control for elevation and scatter will reduce score elevation associated with a faking strategy of avoiding extreme responses. Although intended to be a faking strategy, respondents who employ the avoiding extreme responses strategy should have reduced Black-White differences in extreme responding. When ratings have been adjusted for elevation and scatter, the mean Black-White difference in extreme responding should be reduced or removed and thus the use of the avoiding extreme responses strategy should have little impact on Black-White differences in SJT scale scores. However, for consensually scored SJTs using raw (i.e., unadjusted) Likert ratings, the data are expected to show Black-White differences in extreme responding and the avoiding extreme responses strategy should reduce the Black-White mean differences in the SJT scale score. Thus, we offer the following hypothesis: Hypothesis 4: SJTs with consensus scoring based on raw (i.e. unadjusted) Likert ratings will show smaller Black-Whites differences in SJT scale scores for those tests completed with an avoiding extreme responses strategy. For a seven-point rating scale such as described in the Cullen et al. (2006) study, the respondents who follow the avoiding extreme response strategy would only respond using three rating points: 4, 5, and 6. Although intended as a faking strategy, it could also be a scale construction strategy. That is, the researcher could recode more extreme ratings to be more moderate responses. Such a scale construction method largely controls for elevation and scatter. Thus, we suggest that tests completed with this strategy will show large magnitude correlations with other methods that control for elevation and scatter (e.g., within-person z transformations and dichotomous scoring). It would also follow that tests completed with this faking strategy Situational judgment item validity 5 will have higher criterion-related validity than tests scored using raw (i.e., unadjusted) Likert scales. Thus, we offer the following hypotheses: Hypothesis 5: SJT scale scores based on the avoiding extreme responses faking strategy will have large magnitude correlations with SJT scales that control for elevation and scatter. Hypothesis 6: SJT raw consensus scale scores based on the avoiding extreme responses faking strategy will have larger criterion-related validity than SJT raw consensus scale scores based on raw (i.e., unadjusted) Likert scales. Across respondents, the rated mean effectiveness on Likert scales varies across response options. Some response options have a mean indicating that the behavior, as rated by most respondents, is an effective solution to the problem described in the item stem and other responses have means indicating that most respondents believe that the behavior is ineffective. Response options also have variance. Some responses have low variance indicating that most respondents rated the response option near its mean rating. Other response options have high variance indicating that there was substantial disagreement among respondents concerning the effectiveness of the response option. This disagreement may reflect some ambiguity in the response option that requires the respondent to make inferences about the response option and/or the scenario. For example, if the situational judgment scenario concerns a miscommunication between a supervisor and a subordinate that has resulted in the subordinate feeling ill-treated, the response option “talk to your supervisor” is not informative concerning the content of the talk. Some might infer that the purpose of the talk is to resolve the miscommunication politely and the respondent might judge this to be an effective behavior. Another might infer that the purpose of the talk is to express anger at the supervisor and the respondent might judge this to be an ineffective behavior. When respondents disagree on the effectiveness of the response option, the variance of the ratings reflects this disagreement. We suspect that a response option with a larger than typical variance is more likely to have a mean near the midpoint of a Likert scale (e.g., near a 5 on a 9-point scale). Item validity may be related to item means. We located two studies that examined the relationship between consensual means of experts and item validity. Both Waugh and Russell (2006) and Putka and Waugh (2007) reported U-shaped relationships between item validity such that items with low or high means had the highest validity. We argue that response options with means near the mid-point of the Likert scale have less informational value than response options near either the low or high end of the Likert scale. They might be less relevant to the scenario presented in the stem and thus provide little information on whether the respondent knows how to respond effectively. They might also be near the mid-point because the respondents show substantial variance (i.e., little agreement) on the effectiveness of the response option. Under either explanation, the response options have less information and may be less valid. Thus we offer the following hypothesis: Hypothesis 7: There will be a U-shaped relationship between response option respondent means and item criterion-related validity such that items with low means and high means will be more valid than items with means near the mid-point of the Likert scale. This hypothesis applies to the raw consensus, standardized consensus, and dichotomous consensus scores. Situational judgment item validity 6

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Toward an understanding of situational judgment item validity and group differences.

This paper evaluates 2 adjustments to common scoring approaches for situational judgment tests (SJTs). These adjustments can result in substantial improvements to item validity, reductions in mean racial differences, and resistance to coaching designed to improve scores. The first adjustment, applicable to SJTs that use Likert scales, controls for elevation and scatter (Cronbach & Gleser, 1953)...

متن کامل

Situational judgment tests in high-stakes settings: issues and strategies with generating alternate forms.

This study used principles underlying item generation theory to posit competing perspectives about which features of situational judgment tests might enhance or impede consistent measurement across repeat test administrations. This led to 3 alternate-form development approaches (random assignment, incident isomorphism, and item isomorphism). The effects of these approaches on alternate-form con...

متن کامل

Developing a situational judgment test blueprint for assessing the non-cognitive skills of applicants to the University of Utah School of Medicine, the United States

PURPOSE The situational judgment test (SJT) shows promise for assessing the non-cognitive skills of medical school applicants, but has only been used in Europe. Since the admissions processes and education levels of applicants to medical school are different in the United States and in Europe, it is necessary to obtain validity evidence of the SJT based on a sample of United States applicants. ...

متن کامل

Validity evidence for the situational judgment test paradigm in emotional intelligence measurement.

To date, various measurement approaches have been proposed to assess emotional intelligence (EI). Recently, two new EI tests have been developed based on the situational judgment test (SJT) paradigm: the Situational Test of Emotional Understanding (STEU) and the Situational Test of Emotion Management (STEM). Initial attempts have been made to examine the construct-related validity of these new ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009